专利摘要:

公开号:ES2883750A2
申请号:ES202190052
申请日:2020-02-14
公开日:2021-12-09
发明作者:Geest Bartholomeus Wilhelmus Damianus Van;Bart Kroon
申请人:Koninklijke Philips NV;
IPC主号:
专利说明:

[0002] Image signal representing a scene
[0004] field of invention
[0006] The invention relates to an image signal representing a scene and in particular, but not exclusively, to the generation of an image signal representing a scene and the rendering of images from this image signal as part of an application of virtual reality.
[0008] Background of the invention
[0010] The variety and scope of image and video applications have increased considerably in recent years, with the continuous development and introduction of new services and ways of using and consuming video.
[0012] For example, an increasingly popular service is the provision of image sequences in such a way that the viewer can actively and dynamically interact with the system to modify the parameters of the representation. A very attractive feature in many applications is the ability to change the effective viewing position and viewing direction of the viewer, such as, for example, allowing the viewer to move and "look around" the scene being viewed. presenting.
[0014] This feature may specifically enable a virtual reality experience to be provided to a user. This can allow the user, for example, to move (relatively) freely in a virtual environment and dynamically change their position and where they are looking. Typically, these virtual reality applications are based on a three-dimensional model of the scene, and the model is dynamically evaluated to provide the specific view requested. This approach is well known, for example, in PC and console game applications, such as first-person shooters.
[0016] It is also desirable, particularly for virtual reality applications, that the displayed image be a three-dimensional image. Indeed, to optimize viewer immersion, it is preferred that the user experience the presented scene as a three-dimensional scene. In fact, a virtual reality experience should preferably allow the user to select their own position, the point of view of the camera and the moment in time in relation to the virtual world.
[0018] Virtual reality applications are typically inherently limited in the sense that they are based on a predetermined model of the scene, and usually on an artificial model of a virtual world. It is often desirable to provide a virtual reality experience based on real world capture. However, in many cases this approach is limited or tends to require a virtual model of the real world to be built from real world captures. The virtual reality experience is then generated by evaluating this model.
[0020] However, current approaches tend to be suboptimal, often requiring a lot of computing or communication resources and/or providing a suboptimal user experience with, for example, reduced quality or limited freedom.
[0022] In many systems, such as when specifically based on a real-world scene, an image representation of the scene is provided where the image representation includes imagery and depth for one or more capture points/viewpoints in the scene . Imaging plus depth provides very efficient characterization, in particular, of a real-world scene where the characterization is not only relatively easy to generate by capturing the real-world scene, but is also well suited to a renderer. that synthesizes views for viewpoints other than those captured. For example, a renderer may be arranged to dynamically generate views that match a current local posture of the viewer. For example, you can dynamically determine the viewer's posture and dynamically generate views that match this viewer's posture based on the images and, for example, depth maps provided.
[0024] However, these image representations tend to result in a very high data rate for a given image quality. In order to provide a good capture of the scene and, in particular, to deal with occlusion phenomena, it is desired that the scene be captured from close capture positions and that they cover a wide range of positions. Consequently, a relatively high number of images is desired. In addition, the capture views of the cameras often overlap, and therefore the image set tends to include a large amount of redundant information. These problems they tend to be independent of the specific capture configuration and specifically whether linear or, for example, circular capture configurations are used.
[0026] Thus, while many of the conventional image formats and representations can perform well in many applications and services, they tend to be suboptimal in at least some circumstances.
[0028] Therefore, an improved approach to processing and generating an image signal comprising an image representation of a scene would be advantageous. In particular, a system and/or approach that allows for improved performance, increased flexibility, improved virtual reality experience, reduced data rates, increased efficiency, easier distribution, reduced complexity, easier implementation, would be advantageous. reduced storage requirements, improved image quality, improved rendering, improved user experience, improved trade-off between image quality and data rate, and/or improved performance and/or operation.
[0030] Summary of the invention
[0032] Consequently, the invention preferably seeks to mitigate, alleviate or eliminate individually or in any combination one or more of the aforementioned disadvantages.
[0034] According to one aspect of the invention, there is provided an apparatus for generating an image signal, the apparatus comprising: a receiver for receiving a plurality of source images representing a scene from different viewing postures; a combined image generator for generating a plurality of combined images from the source images, each combined image being derived from a set of at least two source images of the plurality of source images, each pixel of a combined image representing the scene for a travel posture and the travel postures for each combined image including at least two different positions, a travel posture for a pixel representing a posture for a travel in a viewing direction for the pixel and from a viewing position for the pixel ; an evaluator for determining prediction quality measures for elements of the plurality of source images, a prediction quality measure for an element of a first source image is indicative of a difference between pixel values in the first source image for the pixels in the element and the predicted pixel values for the pixels in the element, the predicted pixel values are pixel values resulting from the prediction of the pixels in the element of the plurality of combined images; a determiner for determining segments of the source images comprising elements for which the prediction quality measure is indicative of a difference above a threshold; and an image signal generator for generating an image signal comprising image data representing the combined images and image data representing the segments of the source images.
[0036] The invention can provide an improved representation of a scene and can in many embodiments and scenarios provide improved image quality of the rendered images versus the data rate of the image signal. In many embodiments, a more efficient representation of a scene may be provided, for example, by allowing a given quality to be achieved with a reduced data rate. The approach may provide a more flexible and efficient approach to rendering images of a scene and may allow better adaptation, for example, to the properties of the scene.
[0038] The approach may, in many embodiments, employ an image representation of a scene suitable for a flexible, efficient, high-performance application of Virtual Reality (VR). In many embodiments, it may allow or enable a VR application with a substantially improved trade-off between image quality and data rate. In many embodiments, it may allow for improved perceived image quality and/or reduced data rate.
[0040] The approach may be suitable, for example, for video broadcast services that support adaptation to head movement and rotation at the receiving end.
[0042] The source images may specifically be light intensity images with associated depth information, such as depth maps.
[0044] The focus may, in particular, allow the combined images to be optimized for foreground and background information, respectively, with the segments providing additional data where specifically appropriate.
[0046] The image signal generator may be arranged to use more efficient coding of the combined images than that of the segments. However, the segments they can normally constitute a relatively small proportion of the combined image data.
[0048] According to an optional feature of the invention, the combined image generator is arranged to generate at least a first combined image of the plurality of combined images by vision synthesis of the pixels of the first combined image of the plurality of source images, where each pixel of the combined first image represents the scene for a walk pose and the walk poses for the first image comprise at least two different positions.
[0050] This can provide particularly advantageous operation in many embodiments, and can, for example, allow combined images to be generated for viewing postures where they can (usually in combination) provide a particularly advantageous representation of the scene.
[0052] According to an optional feature of the invention, a dot product between a vertical vector and the pixel cross product vectors is non-negative for at least 90% of the pixels in the first combined image, being a pixel cross product vector for a pixel a cross product between a travel direction for a pixel and a vector from a center point for the different viewing postures to a travel position for the pixel.
[0054] This can in many embodiments provide a particularly efficient and advantageous generation of combined images. In particular, it can provide a low-complexity approach to determining a combined image that provides an advantageous representation of the background data by tending to provide a view biased toward a side view.
[0056] According to an optional feature of the invention, the combined image generator is arranged to generate a second combined image from the plurality of combined images by vision synthesis of the pixels of the second combined image from the plurality of source images, where each pixel of the combined second image represents the scene for a travel pose and the travel poses for the second image comprise at least two different positions; and wherein a dot product between the vertical and pixel cross product vectors is non-positive for at least 90% of the pixels in the second combined image.
[0057] This can in many embodiments provide a particularly efficient and advantageous generation of combined images. In particular, it can provide a low-complexity approach to determining a combined image that provides an advantageous representation of the background data by tending to provide a view biased towards different side views.
[0059] According to an optional feature of the invention, the travel poses of the combined first image are selected to be close to an edge of a region comprising the different view poses of the plurality of source images.
[0061] This may provide advantageous operation in many embodiments and may, for example, provide image signal-enhanced background information, thus facilitating and/or improving vision synthesis based on the image signal.
[0063] According to an optional feature of the invention, it is determined that each of the path positions of the first combined image is less than a first distance from an edge of a region comprising the different view positions of the plurality of source images, the first distance being no more than 50% of a maximum internal distance between edge points.
[0065] This may provide advantageous operation in many embodiments and may, for example, provide image signal-enhanced background information, thus facilitating and/or improving vision synthesis based on the image signal. In some embodiments, the first distance is no more than 25% or 10% of the maximum inner distance.
[0067] In some embodiments, at least one viewing posture of the combined images is determined to be less than a first distance from an edge of a region comprising the different viewing postures of the plurality of source images, the first distance being no greater 20%, 10% or even 5% of a maximum distance between two viewing postures of the different viewing postures.
[0069] In some embodiments, at least one viewing posture of the combined images is determined to be at least a minimum distance from a center point of the different viewing postures, the minimum distance being at least 50%, 75%, or even 90% of a distance from the center point to an edge of a region comprising the different viewing postures of the plurality of source images along a line through the center point and the at least one viewing posture.
[0071] According to an optional feature of the invention, for each pixel of a first combined image of the plurality of combined images, the combined image generator is arranged to determine a corresponding pixel in each of the viewing source images for which a corresponding pixel, the corresponding pixel being one that represents a same direction of travel as the pixel of the first combined image; select a pixel value for the pixel of the first merged image as a pixel value of the corresponding pixel in the viewing source image for which the corresponding pixel represents a path that has a greater distance from a center point for the different postures of view, the greatest distance in a first direction along a first axis being perpendicular to a direction of travel for the corresponding pixel.
[0073] This can in many embodiments provide a particularly efficient and advantageous generation of combined images. In particular, it can provide a low-complexity approach to determining a combined image that provides an advantageous representation of the background data by tending to provide a view biased toward a side view.
[0075] According to an optional feature of the invention, the corresponding pixels comprise resampling each source image to give an image representation representing at least a portion of a surface of a viewing sphere surrounding the viewing postures and determining the pixels corresponding as pixels that have the same position in the image representation.
[0077] This can provide a particularly efficient and precise determination of the corresponding pixels.
[0079] The surface of the viewing sphere can be represented, for example, by an equirectangular or cubic map. Each pixel in the viewsphere may have a travel direction, and resampling a source image may include setting a viewsphere pixel value to the source image pixel value for which the travel direction is the same.
[0080] In accordance with an optional feature of the invention, for each pixel of a second combined image the combined image generator is arranged to select a pixel value for the pixel in the second combined image as a pixel value of the corresponding pixel in the vision source image for which the corresponding pixel represents a path having a greater distance from the center point in a direction opposite to the first direction.
[0082] This can in many embodiments provide a particularly efficient and advantageous generation of combined images. In particular, it can provide a low-complexity approach to determining a combined image that provides an advantageous representation of the background data by tending to provide a view biased toward a side view. Furthermore, the combined second image may complement the combined first image by providing a side view from an opposite direction, thus combining with the combined first image to provide a particularly advantageous representation of the scene and specifically background information.
[0084] According to an optional feature of the invention, for each pixel of a third combined image the combined image generator is arranged to select a pixel value for the pixel in the third combined image as a pixel value of the corresponding pixel in the source image of the view for which the corresponding pixel represents a path that has a smaller distance from the center point.
[0086] This can in many embodiments provide a particularly efficient and advantageous generation of combined images. The third combined image may complement the first combined image(s) by providing a more frontal view of the scene which may provide an enhanced representation of foreground objects in the scene.
[0088] According to an optional feature of the invention, for each pixel in a fourth combined image the combined image generator is arranged to select a pixel value for the pixel in the fourth combined image as a pixel value of the corresponding pixel in the source image of the view for which the corresponding pixel represents a path having a greater distance from the center point in a second direction along a second axis perpendicular to a direction of path for the corresponding pixel, the first axis having and the second axis different directions.
[0089] This can in many embodiments provide a particularly efficient and advantageous generation of combined images, and can provide an improved representation of the scene.
[0091] According to an optional feature of the invention, the combined image generator is arranged to generate source data for the first combined image, the source data being indicative of which of the source images is a source for each pixel of the first combined image; and the image signal generator is arranged to include the source data in the image signal.
[0093] This can provide particularly advantageous operation in many embodiments.
[0095] According to an optional feature of the invention, the image signal generator is arranged to include source vision posture data in the image signal, the source vision posture data being indicative of the different vision postures for the images. source images.
[0097] This can provide particularly advantageous operation in many embodiments.
[0099] According to one aspect of the invention, there is provided an apparatus for receiving an image signal, the apparatus comprising: a receiver for receiving an image signal, the image signal comprising a plurality of combined images, each combined image representing derived image data of a set of at least two source images from a plurality of source images representing a scene from different viewing postures, each pixel of a combined image representing the scene for one walking posture and including the walking postures for each combined image at least two different positions, a travel posture for a pixel representing a posture for travel in a viewing direction for the pixel and from a viewing position for the pixel; image data for a set of segments of the plurality of source images, a segment for a first source image comprising at least one pixel of the first source image for which a prediction quality measure for a prediction of the segment of the plurality of combined images is below a threshold; and a processor for processing the image signal.
[0100] According to one aspect of the invention, there is provided a method of generating an image signal, the method comprising: receiving a plurality of source images representing a scene from different viewing postures; generating a plurality of combined images from the source images, each combined image being derived from a set of at least two source images of the plurality of source images, each pixel of a combined image representing the scene for a walking posture, and including the travel postures for each image combined at least two different positions, a travel posture for a pixel representing a posture for a travel in a viewing direction for the pixel and from a viewing position for the pixel; determining prediction quality measures for the elements of the plurality of source images, a prediction quality measure for an element of a first source image that indicates a difference between the values of the pixels in the first source image for the pixels in the element and the predicted pixel values for the pixels in the element, the predicted pixel values being pixel values resulting from the prediction of the pixels in the element of the combined plurality of images; determining the segments of the source images that comprise the elements for which the prediction quality measure is indicative of a difference above a threshold; and generating an image signal comprising image data representing the combined images and image data representing the segments of the source images.
[0102] According to one aspect of the invention, there is provided a method of processing an image signal, the method comprising: receiving an image signal, the image signal comprising: a plurality of combined images, each combined image representing image data derived from a set of at least two source images from a plurality of source images representing a scene from different viewing postures, each pixel of a combined image representing the scene for one walking posture and including the walking postures for each combined image at least two different positions, a travel posture for a pixel representing a posture for a travel in a viewing direction for the pixel and from a viewing position for the pixel; image data for a set of segments of the plurality of source images, a segment for a first source image comprising at least one pixel of the first source image for which a prediction quality measure for a prediction of the segment of the plurality of combined images is below a threshold; and image signal processing.
[0103] According to one aspect of the invention, there is provided an image signal comprising a plurality of combined images, each combined image representing image data derived from a set of at least two source images from a plurality of source images representing a scene from different sources. viewing postures, each pixel representing a combined image including at least two different positions, a travel posture for a pixel representing a posture for a travel in a viewing direction for the pixel and from a viewing position for the pixel ; image data for a set of segments of the plurality of source images, a segment for a first source image comprising at least one pixel of the first source image for which a measure of prediction quality for a prediction of the segment of the plurality of combined images is below a threshold.
[0105] These and other aspects, features and advantages of the invention will become apparent and will be explained with reference to the embodiment(s) described below.
[0107] Brief description of the drawings
[0109] Embodiments of the invention will be described, by way of example only, with reference to the drawings, in which
[0111] Fig. 1 illustrates an example of an arrangement for providing a virtual reality experience;
[0113] Fig. 2 illustrates an example of a scene capture arrangement;
[0115] Fig. 3 illustrates an example of a scene capture arrangement;
[0117] Fig.4 illustrates an example of elements of an apparatus according to some embodiments of the invention;
[0119] Fig. 5 illustrates an example of elements of an apparatus according to some embodiments of the invention;
[0121] Fig. 6 illustrates an example of pixel selection in accordance with some embodiments of the invention; Y
[0122] Fig. 7 illustrates an example of pixel selection according to some embodiments of the invention.
[0123] Fig. 8 illustrates an example of elements of a walking posture arrangement for a combined image generated in accordance with some embodiments of the invention;
[0125] Fig. 9 illustrates an example of elements of a tour arrangement for a combined image generated in accordance with some embodiments of the invention;
[0127] Fig. 10 illustrates an example of elements of a travel posture arrangement for a combined image generated in accordance with some embodiments of the invention;
[0129] Fig. 11 illustrates an example of elements of a travel posture arrangement for a combined image generated in accordance with some embodiments of the invention;
[0131] Fig. 12 illustrates an example of elements of a walking posture arrangement for a combined image generated in accordance with some embodiments of the invention; Y
[0133] Fig. 13 illustrates an example of elements of a walking posture arrangement for a combined image generated in accordance with some embodiments of the invention.
[0135] Detailed description of some embodiments of the invention
[0137] Virtual experiences that allow the user to move in a virtual world are becoming increasingly popular and services are being developed to meet this demand. However, the provision of efficient virtual reality services is very difficult, especially if the experience must be based on a capture of a real-world environment and not on a totally virtual artificial world.
[0139] In many virtual reality applications, a viewer posture input is determined that reflects the posture of a virtual viewer in the scene. The virtual reality device/system/application then generates one or more corresponding images to the views and viewpoints of the scene for a viewer that corresponds to the viewer's posture.
[0141] Normally, the virtual reality application generates three-dimensional output in the form of separate images for the left and right eyes. They may be presented to the user by suitable means, such as the individual left and right eye displays of a VR headset. In other embodiments, the image may be presented, for example, on an autostereoscopic display (in which case a larger number of viewing images may be generated for the viewer's posture), or even in some embodiments a single two-dimensional image may be generated (for example, using a conventional two-dimensional screen).
[0143] Viewer posture input can be determined in different ways in different applications. In many embodiments, the physical movement of a user can be directly tracked. For example, a camera surveying the user's area can detect and track the user's head (or even eyes). In many embodiments, the user may wear a VR headset that can be tracked by external and/or internal means. For example, the headset may include accelerometers and gyroscopes that provide information about the movement and rotation of the headset and thus the head. In some examples, the VR headset may transmit signals or include identifiers (eg, visuals) that allow an external sensor to determine the movement of the VR headset.
[0145] In some systems, the viewer's posture may be provided by manual means, for example, the user manually controlling a joystick or similar manual input. For example, the user can manually move the virtual viewer in the scene by controlling a first analog stick with one hand and manually controlling the direction the virtual viewer is facing with the other hand by manually moving a second analog stick.
[0147] In some applications, a combination of manual and automated approaches may be used to generate the input viewer's posture. For example, a headset can track the orientation of the head and the movement/position of the viewer in the scene can be controlled by the user using a joystick.
[0148] Image generation is based on a proper representation of the virtual world/environment/scene. In some applications, a full three-dimensional model can be provided for the scene, and by evaluating this model, views of the scene from a specific "viewer's posture" can be determined.
[0150] In many practical systems, the scene may be represented by an image representation comprising image data. The image data may typically comprise images associated with one or more capture or anchor postures, and may specifically include images for one or more viewports with each viewport corresponding to a posture. specific. A representation of the image may be used comprising one or more images in which each image represents the view of a given viewing port for a given viewing posture. Such postures or viewing positions for which image data is provided are often referred to as anchor postures or positions or capture postures or positions (since the image data may typically correspond to images that are or would be captured by cameras located in the scene with the position and orientation corresponding to the capture posture).
[0152] Based on such image representation, many typical VR applications can provide view images corresponding to the view ports in the scene for the current viewer posture, with the images dynamically updating to reflect changes in the viewer's posture and generating the images based on the image data representing the (possibly) virtual scene/environment/world. The application can do this by performing vision synthesis and vision change algorithms, as is known to the skilled person.
[0154] In this field, the terms placement and posture are used as a common term for position and/or direction/orientation. The combination of the position and the direction/orientation of, for example, an object, a camera, a head or a view can be called posture or placement. Thus, a position or posture indication may comprise six values/components/degrees of freedom, with each value/component typically describing a single property of the position/location or orientation/direction of the corresponding object. Of course, in many situations, a placement or posture may be considered or represented with fewer components, for example if one or more components are considered fixed or irrelevant (for example, if all objects are considered to be at the same height and have a horizontal orientation, four components can provide a complete representation of an object's posture). Hereinafter, the term posture is used to refer to a position and/or an orientation that can be represented by one to six values (corresponding to the maximum possible degrees of freedom).
[0156] Many VR applications are based on a posture that has the maximum degrees of freedom, that is, three degrees of freedom from each of the positions and orientation, resulting in a total of six degrees of freedom. Thus, a posture can be represented by a set or vector of six values representing the six degrees of freedom, and thus a posture vector can provide a three-dimensional position and/or three-dimensional direction indication. However, it will be appreciated that in other embodiments, posture may be represented by fewer values.
[0158] A posture may be at least one of an orientation and a position. A posture value may be indicative of at least one of the orientation and position values.
[0160] A system or entity based on providing the maximum degree of freedom to the viewer is often referred to as 6 degrees of freedom (6 GL). Many systems and entities only provide one orientation or position, and these are commonly known as having 3 Degrees of Freedom (3 DF).
[0162] In some systems, the VR application may be provided locally to a viewer, for example, by a stand-alone device that does not use, or even have access to, any remote VR data or processing. For example, a device such as a game console may comprise a store for storing scene data, an input for receiving/generating the viewer's posture, and a processor for generating corresponding images from the scene data.
[0164] In other systems, the VR application may be implemented and run remotely from the viewer. For example, a user's local device may detect/receive movement/posture data which is transmitted to a remote device which processes the data to generate the viewer's posture. The remote device can then generate viewing images appropriate to the viewer's posture based on the scene data that describes it. The viewing images are then transmitted to the viewer's local device where they are presented. For example, the remote device can directly generate a video stream (typically a stereo/3D video stream) that is presented directly by the local device. Thus, in this example, the local device may not perform any VR processing except transmitting motion data and displaying received video data.
[0166] In many systems, the functionality may be distributed between a local device and a remote device. For example, the local device may process the received input and sensor data to generate viewer poses that are continuously transmitted to the remote VR device. The remote VR device can then generate the corresponding vision images and transmit them to the local device for display. In other systems, the remote VR device may not directly generate the vision images, but can select the relevant data from the scene and transmit it to the local device, which can then generate the vision images that are presented. For example, the remote VR device can identify the closest capture point and extract the corresponding scene data (eg spherical image and capture point depth data) and transmit it to the local device. The local device can then process the received scene data to generate the images for the current, specific viewing posture. The vision posture will normally correspond to the posture of the head, and references to the posture of vision can be considered equivalent to references to the posture of the head.
[0168] In many applications, especially for broadcast services, a source may transmit scene data in the form of an image (including video) representation of the scene that is independent of the viewer's posture. For example, an image representation for a single sphere of view for a single capture position may be transmitted to a plurality of clients. Individual clients can then locally synthesize vision images corresponding to the current posture of the viewer.
[0170] One application of particular interest is one in which a limited amount of movement is allowed, such that the displayed views update to follow small movements and rotations corresponding to a substantially static viewer making only small head movements and rotations. For example, a seated viewer can turn their head and move it slightly and the views/images presented adapt to follow these changes in posture. This approach can provide a highly immersive video experience. For example, a spectator watching a sporting event may feel that they are present at a specific point in the stadium.
[0171] Limited-freedom applications of this type have the advantage of providing an enhanced experience while not requiring accurate rendering of a scene from many different positions, substantially reducing capture requirements. Similarly, the amount of data that needs to be fed to a renderer can be substantially reduced. In fact, in many scenarios, it is only necessary to provide image and usually depth data for a single viewpoint, and from this the local renderer can generate the desired views.
[0173] The approach may be specifically well suited to applications where data needs to be communicated from a source to a destination over a limited-band communication channel, such as for a broadcast or client-server application.
[0175] Fig. 1 illustrates an exemplary VR system in which a remote VR client device 101 communicates with a VR server 103, for example, over a network 105, such as the Internet. The server 103 may be arranged to simultaneously support a potentially large number of client devices 101.
[0177] VR server 103 may, for example, support a streaming experience by transmitting an image signal comprising an image representation in the form of image data that can be used by client devices to locally synthesize vision images corresponding to postures. appropriate.
[0179] In many applications, such as the one in Fig.1, it may be desirable to capture a scene and generate an efficient image representation that can be efficiently included in an image signal. The image signal can then be transmitted to various devices that can locally synthesize views for viewing postures other than the one captured. To this end, the image representation may typically include depth information and, for example, images with associated depth may be provided. For example, depth maps can be obtained using stereoscopic capture in combination with disparity estimation or using range sensors, and these depth maps can be provided with the light intensity images.
[0181] However, a particular problem with these approaches is that changing the viewing posture can change the occlusion characteristics, causing the fundus segments to that are not visible in a given captured image become visible for a different viewing posture.
[0183] To do this, a relatively large number of cameras are often used to capture a scene. Fig. 2 shows an example of capture using an 8-view circular camera equipment. In the example, the cameras are facing outside. As can be seen, different cameras, and thus different capture/source images, can have visibility of different parts of the scene. For example, background region 1 is only visible from camera 2. However, as can also be seen, much of the scene is visible from multiple cameras, creating a significant amount of redundant information.
[0185] Fig. 3 shows an example of a linear array of cameras. Again, the cameras provide information from different parts of the scene, for example, c1 is the only camera capturing region 2, c3 is the only camera capturing region 4, and c4 is the only camera capturing region 3. At the same time, some parts of the scene are captured by more than one of the cameras. For example, all cameras capture the front of foreground objects fg1 and fg2, but some cameras provide better capture than others. Fig. 3 shows an example A for four cameras and an example B for two cameras. As you can see, the four-camera setup provides better capture, including capturing part of the scene (region 4 of the background bg), but of course it also generates more data, including more redundant data.
[0187] A disadvantage of capturing multiple views over a single central view is obviously the larger amount of image data. Another disadvantage is the large amount of pixels generated, that is, the rate of pixels to process and that the decoder must produce. This also requires higher complexity and resource usage for vision synthesis during playback.
[0189] In the following, a specific approach that uses a more efficient and less redundant imaging of the captured views will be described. It is intended to preserve some spatial and temporal coherence of the image data, allowing video encoders to be more efficient. Reduces the bit rate, pixel rate, and complexity of vision synthesis at the playback site.
[0190] This representation comprises a plurality of merged images, each of which is generated from two or more of the source images (which may specifically be captured 3D images, e.g. represented as image plus depth map) and is normally only displayed. consider a part of each of the source images. The combined images can provide a reference for vision synthesis and provide substantial scene information. The combined images can be generated to be biased towards more external views of the scene, and specifically towards the edges of the capture region. In some embodiments, one or more central combined images may also be provided.
[0192] In many embodiments, each of the combined images represent views from different viewing positions, ie, each image may comprise at least pixels corresponding to different viewing/capturing/anchoring postures. Specifically, each pixel in a combined image can represent a path posture corresponding to an origin/position and a direction/orientation for a path from that origin/position directed in that direction/orientation and ending at the scene/object point which is represented by the pixel value for that pixel. At least two pixels of a combined image may have different path origins/positions. For example, in some embodiments, the pixels of a combined image may be divided into N groups where all pixels in a group have the same path origin/position, but the path origin/position is different for the individual groups. N can be two or more. In some embodiments, N may be equal to the maximum number of horizontal pixels in a row (and/or the number of columns in the combined image), and indeed in some embodiments, N may be equal to the number of pixels, i.e. all pixels can have a single trace origin/position.
[0194] A path posture for a pixel may represent an origin/position, and/or an orientation/direction for a path between the origin/position and the point in the scene represented by the pixel. The origin/position may specifically be a viewing position for the pixel and the orientation/direction may be the viewing direction for the pixel. It can effectively represent the light path that would be captured at the path position from the path direction for the pixel, and thus reflects the light path that is represented by the pixel value.
[0196] Thus, each pixel can represent the scene viewed from one viewing position in one viewing direction. The viewing position and viewing direction accordingly define a tour. Each pixel may have an associated viewing path from the viewing position for the pixel and in the viewing direction for the pixel. Each pixel represents the scene for a path (view) posture which is the path posture from a view point/position to the pixel and in a view direction. The pixel may specifically represent the point in the scene (point in the scene) where the view path intersects an object in the scene (including the background). A pixel can represent light paths from a point in the scene to the viewing position and in the viewing direction. The view path may be a path from the view position in the direction intersecting the scene point.
[0198] In addition, the combined images are supplemented by segments or fragments of the captured views that have been identified as not being predicted well enough by the combined images. Therefore, a usually large number of, usually small, segments are defined and included to specifically represent individual portions of the captured images that may provide information about elements of the scene not sufficiently well represented by the combined images.
[0200] An advantage of this representation is that different encodings can be provided for different parts of the image data to be transmitted. For example, complex and efficient encoding and compression can be applied to the combined images, since these tend to make up the majority of the image signal, while less efficient encoding can be applied to the segments. Furthermore, the combined images can be generated in a way that is well suited for efficient coding, for example, by generating them in a way that is similar to conventional images, which allows efficient image coding approaches to be used. In contrast, the properties of the segments may vary much more depending on the specific characteristics of the images, and thus may be more difficult to encode as efficiently. However, this is not a problem, as segments tend to provide much less image data.
[0202] Fig. 4 illustrates an example of an apparatus for generating an image signal including a representation of a plurality of source images of the scene from different source viewing postures (anchor postures) as described above. The apparatus will also be referred to as an image signal transmitter 400. The image signal transmitter 400 may be comprised, for example, in the VR server 103 of Fig. 1.
[0203] Fig. 5 illustrates an example of a viewing image display apparatus based on a received image signal including a display of a plurality of images of the scene. The apparatus may specifically receive the image data signal generated by the apparatus of Fig. 4 and process it to display images for specific viewing postures. The apparatus of Fig. 5 will also be referred to as image signal receiver 500. Image signal receiver 500 may, for example, be included in client device 101 of Fig. 1.
[0205] The image signal transmitter 400 comprises an image source receiver 401 that is arranged to receive a plurality of source images of the scene. Source images can represent views of the scene from different viewing positions. The source images may typically be captured images, eg captured by the cameras of a camera crew. The source images may comprise, for example, images from a row of equidistant capture cameras or from a ring of cameras.
[0207] In many embodiments, the source images may be 3D images comprising 2D images with associated depth information. The 2D images can be specifically vision images of the scene from the corresponding capture posture, and the 2D image can be accompanied by an image or a depth map comprising depth values for each of the pixels of the 2D image . The 2D image can be a texture map. The 2D image may be a light intensity image.
[0209] The depth values may be, for example, disparity values or distance values, for example, indicated by a z-coordinate. In some embodiments, a source image may be a 3D image in the form of a texture map with an associated 3D mesh. In some embodiments, such texture maps and mesh representations may be converted to image representations plus depth by the source image receiver prior to further processing by the image signal transmitter 400.
[0211] The source image receiver 401 therefore receives a plurality of source images that characterize and represent the scene from different source viewing postures. This set of source images will allow to generate vision images for other poses using algorithms such as vision displacement, as it is known for the expert. Accordingly, the image signal transmitter 400 is arranged to generate an image signal comprising image data for the source images and transmit this data to a remote device for local display. However, direct transmission of all source images will require an infeasible data rate and will comprise a large amount of redundant information. The image signal transmitter 400 is arranged to reduce the data rate using an image representation as described above.
[0213] Specifically, the input source receiver 401 is coupled to a combined image generator 403 that is arranged to generate a plurality of combined images. The combined images comprise information derived from a plurality of source images. The exact approach for deriving the combined images may differ between different embodiments, and specific examples will be described in more detail below. In some embodiments, a combined image may be generated by selecting pixels from different source images. In other embodiments, the combined images may alternatively or additionally generate one or more of the combined images by vision synthesis of the source images.
[0215] However, while each combined image includes a contribution from at least two, and often more, of the source images, typically only a portion of the individual source images is considered for each combined image. Thus, for each source image used to generate a given merged image, there are some pixels that are excluded/discarded. Therefore, the pixel values generated for the specific combined image do not depend on the values of these pixels.
[0217] The combined images may be generated such that each image does not simply represent one view/capture/anchor position, but rather represents two or more views/captures/anchor positions. Specifically, the path origin/position of at least some pixels in a combined image will be different, and therefore a combined image may represent a view of the scene from different directions.
[0219] The combined image generator 403 may therefore be arranged to generate a plurality of combined images from the source images, where each combined image is derived from a set of at least two source images, and where typically the derivation of one first combined image includes only a part of each of these at least two source images. Furthermore, each pixel of a combined image given represents the scene for one walking pose and the walking poses for each combined image may comprise at least two different positions.
[0221] The combined image generator 403 is coupled to an evaluator 405 which receives the combined images and the source images. Evaluator 405 is ready to determine prediction quality measures for the elements of the source images. An element may be an individual pixel and evaluator 405 may be arranged to determine a prediction quality measure for each pixel of each source image. In other embodiments, the elements may comprise a plurality of pixels, and each element may be a group of pixels. For example, a measure of prediction quality may be determined for blocks of, for example, 4x4 or 16x16 pixel blocks. This can reduce the granularity of the segments or chunks that are determined, but can substantially reduce processing complexity and resource usage.
[0223] The prediction quality measure for a given element is generated to be indicative of a difference between the pixel values in the first source image for the pixels in the element and the predicted pixel values for the pixels in the element. Thus, an element may consist of one or more pixels, and the prediction quality measure for the element may be indicative of the difference between the pixel values for those pixels in the original source image and the values of the pixels. pixels that would result from a prediction of the combined images.
[0225] It will be appreciated that different approaches may be used to determine prediction quality measures in different embodiments. Specifically, in many embodiments, evaluator 405 may proceed to make a prediction of each of the source images from the combined images. Next, for each individual image and each individual pixel, the difference between the original pixel value and the predicted pixel value can be determined. It will be appreciated that any suitable difference measure may be used, such as a simple absolute difference, a root-sum-squared difference applied to the pixel value components of, eg, multiple color channels, and the like.
[0227] Such a prediction can emulate the prediction/view synthesis that image signal receiver 500 can perform to generate views for the viewing postures of the source images. Forecast quality measures therefore reflect how well a receiver of the combined images may be able to generate the original source images based solely on the combined images.
[0229] A predicted image for a source image of the combined images may be an image for the viewing posture of the source image generated by the viewing synthesis of the combined images. Vision synthesis typically includes a vision posture change, and typically a vision position change. The vision synthesis may be a vision displacement image synthesis.
[0231] A prediction of a first image from a second image may specifically be a viewing synthesis of an image at the viewing posture of the first image based on (and the viewing posture of) the second image. Thus, a prediction operation to predict a first image from a second image may be a shift of the viewing posture of the second image from the viewing posture associated with it to the viewing posture of the fist image. .
[0233] It will be appreciated that different methods and algorithms may be used for vision synthesis and prediction in different embodiments. In many embodiments, a vision synthesis and prediction algorithm may be used that takes as input a synthesis vision posture for which the synthesized image is to be generated, and a plurality of input images, each of which is associated with a different viewing posture. The vision synthesis algorithm can then generate the synthesized image for this vision posture based on the input images which can typically include both a texture map and depth.
[0235] Various such algorithms are known, and any suitable algorithm may be used without departing from the invention. As an example of this approach, intermediate synthesis and prediction images can first be generated for each input image. This can be achieved, for example, by first generating a mesh for the input image based on the depth map of the image. The mesh can then be deformed/displaced from the input image view pose to the synthesis view pose based on geometric calculations. The vertices of the resulting mesh can be projected onto the intermediate synthesis/prediction image and the texture map can be overlaid on this image. This process can be implemented, for example, using the known vertex processing and fragment shaders, for example, in standard graph pipelines.
[0236] In this way, for each of the input images, an intermediate synthesis/prediction image (hereinafter, only intermediate prediction image) can be generated for the posture of the synthesis view.
[0238] The intermediate prediction images can then be combined with each other, for example, by weighted combination/add or by selection combination. For example, in some embodiments, each pixel of the synthesis/prediction image for the synthesis vision posture may be generated by selecting the pixel of the intermediate prediction image that is furthest ahead, or the pixel may be generated by a weighted sum of the value of the corresponding pixel for all intermediate prediction images, where the weight for a given intermediate prediction image depends on the given depth for that pixel. The join operation is also known as a merge operation.
[0240] In some embodiments, prediction quality measures may be performed without performing a full prediction, but rather an indirect measure of prediction quality may be used.
[0242] The measure of prediction quality, for example, can be determined indirectly by evaluating a process parameter involved in vision change. For example, the amount of geometric distortion (stretching) that occurs in a primitive (usually a triangle) when performing the vision posture change. The greater the geometric distortion, the lower the prediction quality measure for any pixel represented by this primitive.
[0244] Evaluator 405 may thus determine prediction quality measures for elements of the plurality of source images, where a prediction quality measure for an element of a first source image is indicative of a difference between the predicted pixel values for the pixels in the element predicted from the plurality of combined images and the pixel values in the first source image for the pixels in the element.
[0246] The evaluator 405 is coupled to a determiner 407 that is arranged to determine segments of the source images that comprise elements for which the prediction quality measure is indicative that the difference is above a threshold/the The prediction quality measure is indicative that the prediction quality is below a threshold.
[0248] The segments may correspond to inducible elements determined by evaluator 405 and for which the prediction quality measure is below a quality threshold. However, in many embodiments, the determiner 407 may be arranged to generate segments by grouping such elements, and in fact the grouping may also include some elements for which the prediction quality measure is above the threshold.
[0250] For example, in some embodiments, determiner 407 may be arranged to generate segments by grouping all adjacent elements that have a prediction quality measure below a quality threshold (hereinafter, low prediction quality measures and low elements). quality, respectively).
[0252] In other embodiments, the determiner 407 may be arranged, for example, to fit segments of a given size and shape to the images such that they include as many low-quality elements as possible.
[0254] Consequently, the determiner 407 generates a set of segments that include the low quality elements and therefore cannot be predicted with sufficient accuracy from the combined images. Typically, the segments will correspond to a low proportion of the source images and thus a relatively small amount of pixel and image data.
[0256] The determiner 407 and combined image generator 403 are coupled to an image signal generator 409 which receives the combined images and segments. Image signal generator 409 is arranged to generate an image signal comprising image data representing the combined images and image data representing the segments.
[0258] The image signal generator 409 may specifically encode the combined images and segments and may specifically do so differently and use different coding algorithms and standards for the combined images and segments.
[0259] Typically, the combined images are encoded using highly efficient image coding algorithms and standards, or highly efficient video coding algorithms and standards if the images are frames of a video signal.
[0261] The encoding of the segments can normally be less efficient. For example, the segments may be combined into segment images where each image may comprise segments from a plurality of source images. These combined segment images may be encoded using a standard image or video encoding algorithm. However, due to the mixed and partial nature of such combined segment images, encoding is typically less efficient than for normal full images.
[0263] As another example, due to the sparse nature of segments, they may not be stored in full frames/images. In some embodiments, the segments may be represented, for example, as meshes in 3D space using VRML (Virtual Reality Modeling Language).
[0265] The image data of the segments may typically be accompanied by metadata indicative of the origin of the segments, such as the coordinates of the original image and the origin of the camera/source image.
[0267] The image signal is transmitted in this example to the image signal receiver 500 which is part of the VR client device 101. The image signal receiver 500 comprises an image signal receiver 501 which receives the image signal from the VR transmitter. image signal 400. Image signal receiver 501 is arranged to decode the received image signal to recover the combined images and segments.
[0269] The image signal receiver 501 is coupled to an image processor 503 which is arranged to process the image signal, and specifically the combined images and segments.
[0271] In many embodiments, image processor 503 may be arranged to synthesize viewing images for different viewing postures based on the combined images and segments.
[0272] In some embodiments, image processor 503 may proceed to first synthesize source images. The parts of the synthesized source messages for which a segment is included in the image signal may then be replaced by the image data of the provided segments. The resulting source images can then be used for conventional image synthesis.
[0274] In other embodiments, the combined images and segments can be used directly without first retrieving the source images.
[0276] It will be appreciated that image signal transmitter 400 and image signal receiver 500 comprise the functionality necessary to communicate the image signal, including functionality to encode, modulate, transmit, receive, etc. the image signal. It will be appreciated that such functionality will depend on the preferences and requirements of the individual implementation and that such techniques will be known to those skilled in the art and therefore, for the sake of clarity and brevity, will not be further discussed herein. document.
[0278] Different approaches can be used to generate the combined images in different embodiments.
[0280] In some embodiments, the combined image generator 403 may be arranged to generate the combined images by selecting pixels from the source images. For example, for each pixel of a combined image, the combined image generator 403 may select a pixel from one of the source images.
[0282] An image and/or depth map comprises pixels having values that can be considered to represent the corresponding image property (intensity/light intensity or depth) of the scene along a path having a direction of travel ( orientation) from a travel origin (position). The origin of the path is normally the viewing posture of the image, but in some representations it may vary on a pixel basis (as, for example, in the case of omnidirectional stereoscopy, where the image itself can be considered to have a viewing posture corresponding to the center of the omnidirectional stereoscopic circle, but each pixel has an individual viewing posture corresponding to the position in the omnidirectional stereoscopic circle). The direction of travel can vary from pixel to pixel, especially in the case of images where all pixels have the same origin of travel (ie there is a single common viewing posture of the image). The origin and/or travel direction is also often referred to as travel posture or travel projection posture.
[0284] In this way, each pixel is linked to a position that is the origin of a path/straight line. Also, each pixel is bound to a direction which is the direction of travel/straight line from the origin. Therefore, each pixel is linked to a path/straight line that is defined by a position/origin and a direction from this position/origin. The pixel's value is given by the appropriate property for the scene at the first intersection of the path for the pixel and an object in the scene (including a background). Thus, the pixel value represents a property of the scene along a path/straight line that originates from a path origin position and has a path direction associated with the pixel. The pixel value represents a property of the scene along a path having the position of the pixel path.
[0286] Therefore, for a given first pixel in the combined image being generated, the combined image generator 403 can determine the corresponding pixels in the source images as pixels representing the same direction of travel. Corresponding pixels may be pixels that represent the same direction of travel but may have different positions since the source images may correspond to different positions.
[0288] Thus, in principle, for a given pixel in the combined image, the combined image generator 403 can determine a direction of travel, and then determine all pixels in the source images that have the same (within a similarity requirement). given) travel directions and consider them as corresponding pixels. Thus, corresponding pixels will typically have the same direction of travel but different positions/origin of travel.
[0290] The views of the different posture images of the source vision can, for example, be resampled such that the corresponding image coordinates have the corresponding travel directions. For example, when source views are rendered in a partial equirectangular projection format, they are resampled to a full 3607180° version. For example, you can define a vision sphere that surrounds the entire configuration of the source vision. This sphere of view can be divided into pixels and each pixel has a direction of travel. For a given source image, each pixel can be resampled to the viewsphere representation by the pixel value of the viewsphere for a given direction of travel that is set to the value of the pixel in the source view that has the same direction of travel.
[0292] Resampling the source images into a full viewsphere surface representation typically results in N partially filled images, since individual images typically have limited viewports and N is the number of source images. However, the viewports tend to overlap, and consequently the set of viewsphere surface representations tends to provide multiple pixel values for any given direction.
[0294] Combined image generator 403 can now proceed to generate at least one, but typically a plurality of combined images by selecting between corresponding pixels.
[0296] Specifically, a first combined image can be generated to cover a part of the scene. For example, a combined image can be generated that has a predetermined size to cover a certain area of pixels in the viewsphere representations, thus describing this section of the scene. In some embodiments, each of the combined images may cover the entire scene and include the entire area of the viewsphere.
[0298] For each pixel in the first combined image, the combined imager 403 can now consider the corresponding pixels in the viewsphere representations and proceed to select one of the pixels. The combined image generator 403 can specifically generate the first combined image by selecting the pixel value for the combined image as the pixel value for the corresponding pixel in the view source image for which the corresponding pixel represents a path having the largest distance from the center point in a first direction along a first axis perpendicular to a direction of travel for the corresponding pixel.
[0300] The distance from the center point to a path direction can be determined as the distance between the center point paths and the corresponding pixel for that pixel of the combined image.
[0302] The selection can be exemplified in Fig. 6, which is based on the example of a circular origin viewing posture configuration having a center point C.
[0303] In this example, the determination of a pixel of a combined image having a travel direction rc is considered. Source cameras/views 1-4 capture this direction and therefore there are four corresponding pixels. Each of these corresponding pixels represents a different pose, and thus represent paths originating from different positions, as shown. Consequently, there is an offset distance p1-p4 between the paths and the combined image path rc, corresponding to the distance between the center point C and the paths when they are extended backwards (to cross axis 601).
[0305] Fig. 6 also shows a direction/axis 601 perpendicular to the path rc. For a first combined image, the combined imager 403 can now select the corresponding pixel for which the travel distance in this direction is the greatest. Therefore, in this case, the pixel value of the combined image will be selected as the pixel value for camera/view 1, since p1 is the largest distance in this direction.
[0307] The combined image generator 403 can normally proceed to determine a second combined image by performing the same operation but selecting the corresponding pixels that have the largest distance in the opposite direction (it could be considered that the generation of the first and second combined images can be by selection of the largest positive and negative distance respectively with respect to the first direction if the distance is measured as positive when it is in the same direction as the axis and negative when it is in the other direction). Therefore, in this case, the combined image generator 403 will select the pixel value of the combined image as the pixel value of camera/view 4, since p4 is the largest distance in this direction.
[0309] In many embodiments, the combined image generator 403 may further proceed to generate a third combined image by performing the same operation but selecting the corresponding pixels that have the smallest distance in either direction (the smallest absolute distance). Therefore, in this case, the combined image generator 403 will select the pixel value of the combined image as the pixel value of camera/view 3, since p3 is the smallest distance.
[0311] In this way, the combined image generator 403 can generate three combined images for the same part of the scene (and possibly for the entire scene). One of The images will correspond to a selection of pixels that provide the most lateral view of the scene from one direction, another that represents the most lateral view of the scene from the opposite direction, and another that represents the most central view of the scene. This can be illustrated with Fig. 7, which shows the selected viewing directions of each view/camera for, respectively, the center combined image and the two side combined images.
[0313] The resulting images thus provide a very efficient representation of the scene, with one combined image typically providing the best representation of foreground objects and the other two combining to provide background-centric data.
[0315] In some embodiments, the combined image generator 403 may be arranged to further generate one or more combined images by selecting corresponding pixels along an axis direction that is perpendicular to the direction of travel, but is different from the previously used axis direction. . This approach may be suitable for non-planar source viewing configurations (ie three-dimensional configurations). For example, for a spherical source vision posture configuration, more than two planes may be considered. For example, you can consider a plane at 0, 60, and 120 degrees, or two orthogonal planes (for example, left-right and up-down planes).
[0317] In some embodiments, the combined images may be generated by vision synthesis/forecasting from the source images. Image generator 103 may specifically generate combined images representing views of the scene from different viewing positions, and specifically from viewing positions other than those of the source images. Also, unlike conventional image synthesis, a combined image is not generated to represent the view of the scene from a single viewing/capturing position, but can represent the scene from different viewing positions even within the same image. combined. Thus, a combined image can be generated by generating pixel values for the pixels of the combined image by vision synthesis/prediction from the source images, but with the pixel values representing different vision positions.
[0319] Specifically, for a given pixel in the combined image, vision synthesis/prediction can be performed to determine the pixel value corresponding to the specific path posture for that pixel. This can be repeated for all pixels in the combined image, but with at least some of the pixels having traversal poses with different positions.
[0321] For example, a single combined image may provide a 360° representation of the scene corresponding to, for example, a viewsphere surface surrounding the entire source view configuration. However, views of different parts of the scene can be rendered from different positions within the same combined image. Fig. 8 illustrates an example where the combined image comprises pixels representing two different path positions (and thus pixel viewing positions), namely a first path origin 801 which is used for the pixels. representing one hemisphere and a second path origin 803 representing the other hemisphere. For each of these travel origin/positions, the pixels are provided with different travel directions as shown by the arrows. In the specific example, the source view configuration comprises eight source views (1-8) in a circular arrangement. Each camera view only provides a partial view, for example a 90° view, but with an overlap between the views. For a given pixel in the combined image, there may be an associated track pose, and the pixel value for this track pose can be determined by vision synthesis/prediction from the source views.
[0323] In principle, each pixel of the combined image can be individually synthesized, but in many embodiments combined synthesis is performed for a plurality of pixels. For example, a single 180° image for the first position 801 can be synthesized from the vision source images (for example, using positions 2, 1, 8, 7, 6, 5, 4) and a single 180° image for the second position 803 from the vision source images (eg, using positions 6, 5, 4, 3, 2, 1,8). The combined image can then be generated by combining these images. If the separately synthesized images overlap, blending or blending can be used to generate the combined image. Alternatively, the overlapping parts of the combined images can be muted, for example by assigning a reserved color or depth value. This increases the efficiency of video encoding.
[0325] In many embodiments, one or more of the combined images may be generated to represent the scene from a point of view that provides a more lateral view of the scene. For example, in Fig. 8, the center of the vision circle corresponds to the center point of the origin vision postures and to the center of the travel origin positions. for the combined image. However, the travel directions for a given travel origin 801, 803 are not in a predominantly radial direction, but instead provide a side view of the scene. Specifically, in the example, both the first travel origin 801 and the second origin 803 provide left-direction views, that is, the travel directions for both are to the left when viewing the travel origin 801, 803 from the center point.
[0327] The image generator 103 can proceed to generate a second combined image that represents a different view of the scene, and specifically can often advantageously generate a second view of the scene that is complementary to the first view but looks in the opposite direction. For example, image generator 103 may generate a second combined image that uses the same travel origins but in which the travel directions are in the opposite direction. For example, the image generator 103 can generate a second combined image corresponding to the configuration of Fig. 9.
[0329] The two images can provide a very advantageous and complementary representation of the scene, and can typically provide an improved representation of the background portions of the scene.
[0331] In many embodiments, the combined image may also include one or more images that are generated to provide a more frontal view, such as one corresponding to the configuration of Fig. 10. This example may in many embodiments provide an enhanced representation of the image. front of foreground objects.
[0333] It will be appreciated that different travel origin configurations may be used in different embodiments and specifically more origins may be used. For example. Figs. 11 and 12 show examples of two complementary configurations for generating side-aspect blended images where the path origins are distributed on a curve (specifically a circle) that in this case surrounds the source view configuration (often such a curve is would select to match the posture setting of the source view). The figures only show the origins and poses for a portion of the circle/curve and it will be appreciated that in many embodiments a full spherical or 360° view will be generated.
[0334] Fig. 7 can be considered as illustrating another exemplary configuration in which three combined images are generated based on eight travel positions in a circle around a center point. For the first combined image, the directions around a radial circle are selected, for the second image the directions of travel around a 90° angle to the right are selected, and for the third image the directions of travel around a 90° angle to the left. This combination of combined images can provide a highly efficient combined representation of a scene.
[0336] In some embodiments, the image generator 103 may be arranged to generate pixel values for the combined images for specific walk postures by vision synthesis of the source images. Walking poses can be selected differently for different combined images.
[0338] Specifically, in many embodiments, the walk poses for one image may be selected to provide a side view of the scene from the path origin, and the walk poses of another image may be selected to provide a complementary side view.
[0340] Specifically, for a first combined image, the traversal poses may be such that a dot product between a vertical vector and the pixel cross product vectors is non-negative for at least 90% (sometimes 95% or even ). all) of the pixels of the first merged image. The pixel cross product vector for a pixel is determined as a cross product between a travel direction for a pixel and a vector from a center point for the different source viewing postures to a travel position for the pixel.
[0342] The center point for the source vision postures may be generated as a mean or average position for the source vision postures. For example, each coordinate (eg, x, y, z) can be averaged individually and the resulting average coordinate can be the center point. Note that the center point of a configuration is not (necessarily) at the center of the smallest circle/sphere that comprises the source vision poses.
[0344] The vector from the center point to the path origin for a given pixel is thus a vector in scene space that defines a distance and a direction from the point central to the viewing position for that pixel. The direction of the path can be represented by any vector that has the same direction, that is, it can be a vector from the origin of the path to the point of the scene represented by the pixel (and therefore it can also be a vector in the scene space).
[0346] The cross product between these two vectors will be perpendicular to both. For a horizontal plane (in the scene coordinate system), a counterclockwise travel direction (viewed from the center point) will result in a cross product vector that has an upward component, that is, that has a positive z-component in a scene coordinate system x,y,z where z indicates the height. The cross product vector will be up for any view to the left, regardless of trace origin, for example, it will be up for all pixels/positions of the trace in Fig. 8.
[0348] In contrast, for a view to the right, the cross product vector will be downward for all traversing positions, eg, for all pixels/traversing positions in Fig. 9 a negative z-coordinate will result.
[0350] The dot product between a vertical vector in the scene space and all vectors that have a positive z-coordinate will be the same, namely positive for a vertical vector pointing up and negative for a vertical vector pointing down. Conversely, for a negative z-coordinate, the dot product will be negative for a vertical vector pointing up and positive for a vertical vector pointing down. Consequently, the dot product will have the same sign for right-traversing postures and the opposite sign for all left-traversing postures.
[0352] In some scenarios, a null vector or dot product may result (for example, for pole points on a view circle), and for such traversal poses, the sign will be no different from left or right views.
[0354] It will be appreciated that the foregoing considerations also apply, mutatis mutandis, to a three-dimensional representation, such as, for example, when the path origins lie on a sphere.
[0355] Thus, in some embodiments, at least 90%, and in some embodiments at least 95% or even all of the pixels in a combined image result in a dot product that does not have different signs, that is, at least that number of pixels will have a side view to the same side.
[0357] In some embodiments, the combined images may be generated to have guard bands or, for example, some specific edge pixels may have particular circumstances for which the dot product may not meet the requirement. However, for the vast majority of pixels, the requirement is met, and the pixels provide corresponding side views.
[0359] Furthermore, in many embodiments, at least two combined images meet these requirements but with opposite dot product signs. Thus, for one merged image, at least 90% of the pixels may represent a view to the right, and for another merged image, at least 90% of the pixels may represent a view to the left.
[0361] Combined images can be generated for poses that provide a particularly advantageous view of the scene. The inventors have realized that, in many scenarios, it may be particularly advantageous to generate combined images for viewing postures that result in a more lateral view of the main part of the scene, and furthermore that for a given configuration of views source, it may be advantageous to generate at least some views that are near the extreme positions of the pattern rather than near the center of the pattern.
[0363] Thus, in many embodiments, at least one, and typically at least two, of the combined images are generated for walk poses that are close to the edge of a region corresponding to the source view configuration.
[0365] The region may specifically be a region of space (a collection or set of points in space), which is bounded by a larger polygon that can be formed using at least some of the view positions as vertices for straight lines. of the polygon. The polygon may be a plane figure bounded by a finite chain of straight line segments that loop to form a chain or a closed circuit, and this may include a one-dimensional configuration such as Fig. 2A (also known as degenerate polygon). For a three-dimensional configuration, the region can correspond to a possible largest polyhedron formed by at least some of the positions of the source vision. Thus, the region can be the largest polygon or polyhedron that can be formed using at least some of the source view positions as vertices for the lines of the polygon or polyhedron.
[0367] Alternatively, a region comprising the different viewing positions of the plurality of source images may be a smaller line, circle, or sphere that includes all viewing positions. The region may specifically be a smaller sphere that includes all source view positions.
[0369] Thus, in many embodiments, the traversal poses of at least one of the combined images are selected to be close to the edge of the region comprising the source view pose pattern.
[0371] In many embodiments, at least one pan position of the combined images is determined to be less than a first distance from the edge of the region, where this first distance is no more than 50 % or, in many cases, 25 % or more. 10 % of the maximum (interior) distance between points on the edge of the region. Thus, from the viewing position, a minimum edge distance may be no more than 50%, 25%, or 10% of a maximum edge distance.
[0373] This can be illustrated in Fig. 13, which shows an example of source viewpoints indicated by black dots. Fig. 13 further illustrates a region corresponding to the smaller sphere that includes the viewing postures. In the example, the viewing configuration is a planar, two-dimensional configuration, and consideration of a sphere is reduced to consideration of a circle 1301. Fig. 13 further shows a travel posture 1303 for a combined image that is near the edge. of the sphere/circle/region. Specifically, the minimum distance dmin to the edge of the region is much smaller (about 10%) than the maximum distance dmax to the edge / of the region.
[0375] In some embodiments, the walk postures of a combined image may be determined to be less than a first distance from the edge of the region, where the first distance is no more than 20%, or often even 10% or 5%. of the maximum distance between two original viewing postures. In the example where the region is determined to be the smallest sphere/circle that includes all source view poses, the maximum distance between two view poses is equal to the diameter of the source view pose. sphere/circle and thus the viewing posture of the combined image can be selected such that the minimum distance dmin meets this requirement.
[0377] In some embodiments, the travel poses of a combined image may be determined to be at least a minimum distance from a center point of the different viewing poses, where the minimum distance is at least 50%, and often even 75%. or 90%, of the distance from the center point to the edge along a line through the center point and the travel posture.
[0379] In some embodiments, two viewing postures are selected for the combined images such that a distance between them is at least 80%, and sometimes even 90% or 95%, of the maximum distance between two points on an edge. that crosses a line through the two viewing postures. For example, if a line is drawn through the two poses, the distance between the two poses is at least 80%, 90%, or 95% of the distance between the points where the line crosses the circle.
[0381] In some embodiments, a maximum distance between two of the view poses of the combined first image is at least 80% of a maximum distance between points on the edge of a region comprising the different view poses of the plurality of source images.
[0383] The inventors have had the idea that the approach of generating combined images for positions close to the edge of the region comprising the original viewing positions may be particularly advantageous, since it tends to provide more information on background objects in the scene. Most background data is typically captured by cameras or image areas that have the greatest lateral distance from a central point of view. This can be advantageous in combination with a more central blended image as this tends to provide better image information for foreground objects.
[0385] In many embodiments, image signal generator 409 may be arranged to further include metadata for the generated image data. Specifically, the combined image generator 403 may generate source data for the combined images, where the source data indicates which of the source images is the origin for individual pixels in the combined images. Image signal generator 409 may then include this data in the generated image signal.
[0386] In many embodiments, the image signal generator 409 may include source vision posture data indicative of the vision postures for the source images. The data may specifically include data defining the position and direction of each source image/view.
[0388] The image signal may include metadata indicating, possibly individually for each pixel, the position and direction for which the pixel values are provided, ie an indication of the path position. Consequently, the image signal receiver 500 may be arranged to process this data to perform, for example, vision synthesis.
[0390] For example, for each pixel of the three views generated by selecting the corresponding pixels, metadata indicating the identity of the source view may be included. This can result in three label maps, one for the center view and two for the side views. Tags can be linked to vision-specific data, including, for example, camera optics and equipment geometry.
[0392] It will be appreciated that the foregoing description has, for clarity, described embodiments of the invention with reference to different functional circuits, units and processors. However, it will be apparent that any suitable distribution of functionality between different functional circuits, units or processors can be used without departing from the invention. For example, functionality illustrated to be performed by separate processors or controllers may be performed by the same processor or controllers. Therefore, references to specific functional units or circuits should be considered only as references to appropriate means of providing the described functionality and not as an indication of a strict physical or logical structure or organization.
[0394] The invention may be implemented in any suitable way, including hardware, software, firmware, or any combination of these. The invention may optionally be implemented, at least in part, as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally, and logically implemented in any suitable way. In fact, the functionality can be implemented in a single unit, in a plurality of units or as part of other units. functional. Thus, the invention may be implemented in a single unit or may be physically and functionally distributed among different units, circuits, and processors.
[0396] Although the present invention has been described in connection with some embodiments, it is not intended to limit it to the specific form set forth herein. Rather, the scope of the present invention is limited only by the appended claims. Furthermore, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. In the claims, the term "comprising" does not exclude the presence of other elements or steps.
[0398] In addition, although listed individually, a plurality of means, elements, circuits, or method steps may be implemented, for example, by a single circuit, unit, or processor. Furthermore, although individual features may be included in different claims, these may be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Likewise, the inclusion of a feature in one category of claims does not imply a limitation to this category, but rather indicates that the feature is equally applicable to other categories of claims as appropriate. Furthermore, the order of the features in the claims does not imply any specific order in which the features must be worked out and, in particular, the order of the individual steps in a method claim does not imply that the steps must be performed in this order. . Rather, the steps can be performed in any suitable order. Furthermore, singular references do not exclude a plurality. Thus, references to "a", "an", "first", "second", etc. they do not exclude a plurality. Reference signs in the claims are merely provided as a clarifying example and are not to be construed in any way as limiting the scope of the claims.
权利要求:
Claims (18)
[1]
1. An apparatus for generating an image signal, the apparatus comprising:
a receiver (401) for receiving a plurality of source images representing a scene from different viewing postures;
a combined image generator (403) for generating a plurality of combined images from the source images, each combined image being derived from a set of at least two source images of the plurality of source images, each pixel of a combined image representing the scene for a walk pose and including the walk poses for each image combined at least two different positions, a walk pose for a pixel representing a pose for a walk in a viewing direction for the pixel and from a viewing position for the pixel ;
an evaluator (405) for determining prediction quality measures for elements of the plurality of source images, a prediction quality measure for an element of a first source image being indicative of a difference between pixel values in the first source image for the pixels in the element and the predicted pixel values for the pixels in the element, where the predicted pixel values are the pixel values resulting from the prediction of the pixels in the element of the plurality of combined images;
a determiner (407) for determining segments of the source images comprising elements for which the prediction quality measure is indicative of a difference above a threshold; Y
an image signal generator (409) for generating an image signal comprising image data representing the combined images and image data representing the segments of the source images.
[2]
The apparatus of claim 1, wherein the combined image generator (403) is arranged to generate at least a first combined image of the plurality of combined images by vision synthesis of the pixels of the first combined image of the plurality of source images, where each pixel of the combined first image represents the scene for a walk pose and the walk poses for the first image comprise at least two different positions.
[3]
3. The apparatus of claim 2, wherein a dot product between a vertical vector and pixel cross product vectors is non-negative for at least 90% of the pixels in the first combined image, a dot product vector being crossed pixels for a pixel a cross product between a travel direction for a pixel and a vector from a center point for the different viewing postures to a travel position for the pixel.
[4]
The apparatus of claim 3, wherein the combined image generator (403) is arranged to generate a second combined image of the plurality of combined images by vision synthesis of the pixels of the second combined image from the plurality of source images, wherein each pixel of the combined second image represents the scene for a walk pose and the walk poses for the second image comprise at least two different positions; Y
wherein a dot product between the vertical vector and the pixel cross product vectors is non-positive for at least 90% of the pixels in the second combined image.
[5]
5. The apparatus of claim 2, wherein the path poses of the combined first image are selected to be close to an edge of a region comprising the different view poses of the plurality of source images.
[6]
6. The apparatus of claim 2 or 3, wherein each of the path postures of the combined first image is determined to be less than a first distance from an edge of a region comprising the different view postures of the plurality of source images, the first distance being no more than 50% of a maximum internal distance between edge points.
[7]
The apparatus of any preceding claim, wherein the combined image generator (403), for each pixel of a first combined image of the plurality of combined images, is arranged to:
determining a corresponding pixel in each of the vision source images for which a corresponding pixel is present, the corresponding pixel being one that represents a same direction of travel as the pixel in the first combined image;
select a pixel value for the pixel of the first combined image as a pixel value of the corresponding pixel in the source image of the view for which the corresponding pixel represents a path that has a greater distance from a center point for the different poses of view, the greatest distance being in a first direction along a first axis perpendicular to a direction of travel for the corresponding pixel.
[8]
8. The apparatus of claim 7, wherein determining the corresponding pixels comprises resampling each source image to an image representation representing at least a portion of a surface of a viewing sphere surrounding the viewing postures and determining the corresponding pixels as pixels having a same position in the image representation.
[9]
9. The apparatus of claim 7 or 8, wherein the combined image generator (403), for each pixel of a second combined image, is arranged to:
select a pixel value for the pixel in the second combined image as a pixel value of the corresponding pixel in the source image of the vision for which the corresponding pixel represents a path that has a greater distance from the center point in an opposite direction to the first address.
[10]
10. The apparatus of any of claims 7-9, wherein the combined image generator (403), for each pixel of a third combined image, is arranged to:
selecting a pixel value for the pixel in the third combined image as a pixel value of the corresponding pixel in the vision source image for which the corresponding pixel represents a path having a smaller distance from the center point.
[11]
11. The apparatus of any of claims 7-10, wherein the combined imager (403), for each pixel of a fourth combined image, is arranged to: select a pixel value for the pixel in the fourth combined image as a pixel value of the corresponding pixel in the source image of the view for which the corresponding pixel represents a path having a maximum distance from the center point in a second direction along a second axis perpendicular to a path direction for the corresponding pixel, the first axis and the second axis having different directions.
[12]
12. The apparatus of any of claims 7-11, wherein the combined image generator (403) is arranged to generate source data for the first combined image, the source data being indicative of which of the source images is an origin for each pixel of the first combined image; and the image signal generator (409) is arranged to include the source data in the image signal.
[13]
The apparatus of any of the preceding claims, wherein the image signal generator (403) is arranged to include source vision posture data in the image signal, the source vision posture data being indicative of the different vision postures for the source images.
[14]
14. An apparatus for receiving an image signal, the apparatus comprising:
a receiver (501) for receiving an image signal, the image signal comprising a plurality of combined images, each combined image representing image data derived from a set of at least two source images of a plurality of source images representing a scene from different viewing postures, each pixel of a combined image representing the scene for one walking posture and the walking postures for each combined image including at least two different positions, a walking posture for a pixel representing a posture for a walking in a viewing direction for the pixel and from a viewing position for the pixel;
image data for a set of segments of the plurality of source images, a segment for a first source image comprising at least one pixel of the first source image for which a prediction quality measure for a prediction of the segment of the plurality of combined images is below a threshold; Y
a processor (503) for processing the image signal.
[15]
15. A method of generating an image signal, the method comprising:
receiving a plurality of source images representing a scene from different viewing postures;
generate a plurality of combined images from the source images, each combined image derived from a set of at least two source images of the plurality of source images, each pixel of a combined image representing the scene for a walk posture and including the travel postures for each image combined at least two different positions, a travel posture for a pixel representing a posture for a travel in a viewing direction for the pixel and from a viewing position for the pixel;
determining prediction quality measures for elements of the plurality of source images, a prediction quality measure for an element of a first source image being indicative of a difference between the pixel values in the first source image for the pixels in the element and the predicted pixel values for the pixels in the element, the predicted pixel values being the pixel values resulting from the prediction of the pixels in the element of the combined plurality of images;
determining segments of the source images comprising elements for which the prediction quality measure is indicative of a difference greater than a threshold; and generating an image signal comprising image data representing the combined images and image data representing the segments of the source images.
[16]
16. A method of processing an image signal, the method comprising:
receive an image signal, the image signal comprising:
a plurality of combined images, each combined image representing image data derived from a set of at least two source images of a plurality of source images representing a scene from different viewing postures, each pixel of a combined image representing the scene for a travel posture and the travel postures for each combined image including at least two different positions, a travel posture for a pixel representing a posture for a travel in a viewing direction for the pixel and from a viewing position for the pixel; image data for a set of segments of the plurality of source images, a segment for a first source image comprising at least one pixel of the first source image for which a prediction quality measure for a prediction of the segment of the plurality of combined images is below a threshold; Y
process the image signal.
[17]
17. An image signal comprising
a plurality of combined images, each combined image representing image data derived from a set of at least two source images of a plurality of source images representing a scene from different viewing postures, each pixel of a combined image representing the scene for a travel posture and the travel postures for each combined image including at least two different positions, a travel posture for a pixel representing a posture for a travel in a viewing direction for the pixel and from a viewing position for the pixel; image data for a set of segments of the plurality of source images, a segment for a first source image comprising at least one pixel of the first source image for which a measure of prediction quality for a prediction of the segment of the plurality of combined images is below a threshold.
[18]
18. A computer program product comprising computer program code means adapted to cause a computer to carry out all of the steps of claim 15 or 16 when said program is executed on the computer.
类似技术:
公开号 | 公开日 | 专利标题
US20160371884A1|2016-12-22|Complementary augmented reality
JP2018522429A|2018-08-09|Capture and render panoramic virtual reality content
JP2008033531A|2008-02-14|Method for processing information
JP2017532847A|2017-11-02|3D recording and playback
KR20140100656A|2014-08-18|Point video offer device using omnidirectional imaging and 3-dimensional data and method
KR20200116947A|2020-10-13|Image processing device, encoding device, decoding device, image processing method, program, encoding method, and decoding method
CN105869201A|2016-08-17|Method and device for achieving smooth switching of panoramic views in panoramic roaming
CN108693970A|2018-10-23|Method and apparatus for the video image for adjusting wearable device
ES2883750A2|2021-12-09|IMAGE SIGNAL REPRESENTING A SCENE
Vasudevan et al.2010|A methodology for remote virtual interaction in teleimmersive environments
JP6932796B2|2021-09-08|Layered Extended Entertainment Experience
US10893259B2|2021-01-12|Apparatus and method for generating a tiled three-dimensional image representation of a scene
KR101946715B1|2019-02-11|Adaptive search ragne determination method for motion estimation of 360 degree video
KR20170059879A|2017-05-31|three-dimensional image photographing apparatus
KR20210118458A|2021-09-30|image signal representing the scene
TW202029742A|2020-08-01|Image synthesis
JP2021149192A|2021-09-27|Image processing device, image processing method, and program
JP2021157458A|2021-10-07|Information processor and information processing method
JP2022517499A|2022-03-09|Image characteristics Pixel structure generation and processing
KR20210119476A|2021-10-05|Creation and processing of image characteristic pixel structures
TW202101374A|2021-01-01|Processing of depth maps for images
JP6181549B2|2017-08-16|Image generation method, image generation apparatus, and image generation program
BR112021014724A2|2021-09-28|APPARATUS FOR RENDERING IMAGES, APPARATUS FOR GENERATING AN IMAGE SIGNAL, METHOD FOR RENDERING IMAGES, METHOD FOR GENERATING AN IMAGE SIGNAL AND COMPUTER PROGRAM PRODUCT
同族专利:
公开号 | 公开日
DE112020001322T5|2021-12-30|
GB202114892D0|2021-12-01|
EP3712843A1|2020-09-23|
CA3133865A1|2020-09-24|
KR20210141596A|2021-11-23|
EP3942519A1|2022-01-26|
WO2020187506A1|2020-09-24|
GB2596962A|2022-01-12|
TW202046716A|2020-12-16|
CN113614776A|2021-11-05|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

CN101719264B|2009-12-28|2012-05-23|清华大学|Method for computing visual field of multi-view dynamic scene acquisition|
EP3441788A1|2017-08-08|2019-02-13|Koninklijke Philips N.V.|Apparatus and method for generating a representation of a scene|
法律状态:
2021-12-09| BA2A| Patent application published|Ref document number: 2883750 Country of ref document: ES Kind code of ref document: A2 Effective date: 20211209 |
优先权:
申请号 | 申请日 | 专利标题
EP19163678.6A|EP3712843A1|2019-03-19|2019-03-19|Image signal representing a scene|
PCT/EP2020/053981|WO2020187506A1|2019-03-19|2020-02-14|Image signal representing a scene|
[返回顶部]